Weird variability observations?¶

Original message from Dan K (2023-11-02)¶

We noticed that for several of the UMAP traits the variance between the replicates of each genotype (after summarizing pseudorepilicates from individual seed data) was unstable across generations. That is, replicates seem to be much more variable within the parent genotypes than replicates of the progeny genotypes. This is pretty unusual. It is expected that variance across the parents be higher than the progeny, but not within parents. I am wondering whether there is a statistical/mathematical reason for this before we start to think of biological/experimental explanations.


Quick recap¶

  • We will focus on the collection of descriptors that offered the highest classification results
    • 23 combined descriptors total: 11 traditional + 12 topological
    • For the topological descriptors, we used
      • 158 directions
      • 16 thresholds
      • Reduced a 2500-dimensional ECT vector to only 12 dimensions with UMAP
    • Traditional data was scaled
    • UMAP data was not scaled
  • All the data used is exactly the same that was sent in the progeny zip file back in 20-something February 2022 (I remember the date because I was in Georgia for NAPPN at the time. Time flies.)
  • Once we have 23 descriptors per seed, we can do some machine learning

Things to remember¶

  • For the next three bullet points, you can substitute the word UMAP with PCA and the same argument holds.
  • UMAP dimension reduction of the ECT was semi-supervised.
  • We did not reduce the dimension of the ~38,000 seeds at once.
  • Instead, first we used UMAP to reduce the dimension of just the ~3100 seeds of the parents.
  • Based on that, UMAP "learned" a model to reduce ECT dimension of seeds (think of finding the 12 PC vectors in PCA).
  • With this learned model, we reduced the dimension of the seeds from $F_{18}$ and $F_{58}$.

  • Similarly, when it came to scale the traditional data, we first scaled just the traditional descriptors from the ~3100 parent seeds.
  • And based on that, we scaled the $F_{18}$ and $F_{58}$ traditional descriptors.
  • (In other words, instead of computing the mean and variance of the 38,000 seeds, we computed the mean and variance of the parents and used those same values to scale the progeny.)
  • This semisupervised setup was used so that we could perform semi-supervised learning later (explained down below).

First, supervised learning (what is already published back in 2021)¶

  • We are limited to only fully labeled data
  • In this case, the ~3,100 seeds from the parents where we have 28 different labels
  • The whole data set is split 75/25 into training/testing
  • First, 75% of the labeled data is handed to the machine so it can figure out a pattern to characterize each label based on morphological information alone.
  • We asked the computer to figure out these patterns by using SVM (support vector machine), a very well-known, deterministic technique in ML.
  • Once the machine has "learned" the patterns, we ask the computer about the other 25% that we withheld.
  • The machine will give certain response, that it is later compared to the groundtruth (we can do this because all data is labeled, remember.)
  • We showed that by giving the machine the 23 traditional+topological descriptors, it could make the correct prediction 85% of the time.
  • We established with the previous steps that Traditional+ECT+UMAP combined shape descriptors provide an accurate description of seed morphology.
  • These descriptors balance both spike- and accession-level morphological nuances.
    • Traditional shape descriptors do a good job at capturing accession-wide features
    • Topological shape descriptors are good at the spike-level features


Second, semi-supervised learning (where we are now)¶

  • Only part of the data is labeled.
  • In this case, we don't have labels for the $F_{18}$ and $F_{58}$ seeds
  • We simply use 100% of the labeled data (the parents) to let the SVM again figure out again patterns that characterize each of the 28 labels.
  • Then the unlabeled data is thrown into the mix: we get answer and we run with it.

Things to note¶

  • SVM has to assign a label to each seed: it has no way to create an "Undetermined" category
  • There are ways to roughly estimate wheter a label assigment was based on being clearly the best or the least worst option among the 28 possibilites.
  • However, I didn't go that way.

Load and wrangle data¶

  • This time we load the information for all seeds.
    • Parents: 3,121 seeds
    • $F_{18}$: 27,934 seeds
    • $F_{58}$: 6,826 seeds
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
In [2]:
src = '../shape_descriptors/'
dst = '../umap_unsupervised_results/'

gen = 0
T = 16
d = 158
umap_params = {'n_neighbors':50, 'min_dist':0.1, 'n_components':12, 'metric':'manhattan'}
info_type = 'topounscaled'

filename = src + 'combined_d{}_T{}.csv'.format(d,T)
data = pd.read_csv(filename)
tdata = data.iloc[:, :20]
edata = data.iloc[:, :8].join(data.iloc[:,19:])
/tmp/ipykernel_51073/3557317365.py:11: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  data = pd.read_csv(filename)

First, let's recap the parents (gen = 0)¶

  • This is a recap of what is already published.

Visualize the distribution of the descriptors¶

Traditional descriptors¶

  • We perform a 2-dim PCA of the 11 traditional descriptors after scaling them.
  • We observe that traditional descriptors tend to cluster panicles of the same accession.
  • Although there is no clear separation between different accession clusters.
Remember: PCA —as well as non-linear kernel PCA (KPCA)— reduce dimensions by projecting the data to along the orthogonal directions that report the most variance.
  • This means that for a fixed dataset, the PC1 and PC2 values are always the same —except for ±sign— regardless if we reduce the data to 2 or to 12 or to a 100 dimensions.
  • This is an important distinction compared to UMAP —it will be relevant later.
In [3]:
trad = tdata[tdata['Generation'] == gen].reset_index(drop=True)
foo = np.unique(trad['Label (C-G-S-P)'])
mdict = dict(zip(foo,np.arange(len(foo))))
vmax = len(mdict)

tvals = trad['Label (C-G-S-P)'].map(mdict).values

founders_names_original = np.unique(trad.Founder.values)
founders_names = founders_names_original.copy()
founders_names[4] = 'CA Mariout'
founders_names[10] = 'Good Delta'
founders_names[16] = 'Maison Carree'
founders_names[23] = 'Palmella Blue'
founders_names[27] = 'WI Winter'

tt = trad.iloc[:, 8:19].values
g0scaler = StandardScaler().fit(tt)

dims = 2
pca = PCA(n_components=dims, svd_solver='full').fit(g0scaler.transform(tt))
pcat = pca.transform(g0scaler.transform(tt))
In [4]:
cmap = 'viridis'
fs = 15

fig, ax = plt.subplots(1, 1, figsize=(7,4), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

ax[k].scatter(pcat[:,0], pcat[:,1], c=tvals, cmap=cmap, vmin=0, vmax=vmax)
ax[k].set_aspect('equal')

ax[k].set_xlabel('PCA 1 ({:.2f}%)'.format(np.round(100*pca.explained_variance_ratio_[0],1)), fontsize=fs)
ax[k].set_ylabel('PCA 2 ({:.2f}%)'.format(np.round(100*pca.explained_variance_ratio_[1],1)), fontsize=fs)
ax[k].tick_params(labelsize=fs-3)

ax[k].set_title('Traditional shape descriptors after being PCA reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)

fig.tight_layout();
No description has been provided for this image

Above: different colors correspond to different panicles¶

In [5]:
fig, ax = plt.subplots(4, 7, figsize=(18, 6.5), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel()

for i in range(len(founders_names)):
    line = founders_names_original[i]
    accession = trad[trad.Founder == line]
    
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))

    mask = accession.index
    nask = np.setdiff1d(np.arange(len(trad)), mask)
        
    ax[i].scatter(pcat[nask,0], pcat[nask,1], c='darkgrey', marker='.', s=25, zorder=1)
    ax[i].scatter(pcat[mask,0], pcat[mask,1], c=vals, cmap=cmap, marker='.', s=50, vmin=0, vmax=2, zorder=2)
    
    ax[i].set_title(founders_names[i], fontsize=fs)

for i in range(len(ax)):
    ax[i].tick_params(labelbottom=False, labelleft=False)
    ax[i].set_aspect('equal')
    ax[i].set_facecolor('whitesmoke')
    
fig.supxlabel('PC 1 ({:.2f}%)'.format(np.round(100*pca.explained_variance_ratio_[0],1)), fontsize=fs)
fig.supylabel('PC 2 ({:.2f}%)'.format(np.round(100*pca.explained_variance_ratio_[1],1)), fontsize=fs)

fig.suptitle('Traditional shape descriptors after being PCA reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)
fig.tight_layout()

filename = dst + 'gen{}_traditional_d{}_T{}_{}.jpg'.format(gen, d,T, info_type)
print(filename)
#plt.savefig(filename, bbox_inches='tight', dpi=100, format='jpg', pil_kwargs={'optimize':True})
../umap_unsupervised_results/gen0_traditional_d158_T16_topounscaled.jpg
No description has been provided for this image

Above: different colors correspond to different panicles. Colors are unrelated between plots.¶

To emphasize: Notice that PC1 and PC2 are exactly the same if we decide to reduce the traditional shape descriptors to 8 dimensions instead.
In [6]:
dims = 8
pca = PCA(n_components=dims, svd_solver='full').fit(g0scaler.transform(tt))
pcat = pca.transform(g0scaler.transform(tt))

fig, ax = plt.subplots(4, 7, figsize=(18, 6.5), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel()

for i in range(len(founders_names)):
    line = founders_names_original[i]
    accession = trad[trad.Founder == line]
    
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))

    mask = accession.index
    nask = np.setdiff1d(np.arange(len(trad)), mask)
        
    ax[i].scatter(pcat[nask,0], pcat[nask,1], c='darkgrey', marker='.', s=25, zorder=1)
    ax[i].scatter(pcat[mask,0], pcat[mask,1], c=vals, cmap=cmap, marker='.', s=50, vmin=0, vmax=2, zorder=2)
    
    ax[i].set_title(founders_names[i], fontsize=fs)

for i in range(len(ax)):
    ax[i].tick_params(labelbottom=False, labelleft=False)
    ax[i].set_aspect('equal')
    ax[i].set_facecolor('whitesmoke')
    
fig.supxlabel('PC 1 ({:.2f}%)'.format(np.round(100*pca.explained_variance_ratio_[0],1)), fontsize=fs)
fig.supylabel('PC 2 ({:.2f}%)'.format(np.round(100*pca.explained_variance_ratio_[1],1)), fontsize=fs)

fig.suptitle('Traditional shape descriptors after being PCA reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)
fig.tight_layout()

filename = dst + 'gen{}_traditional_d{}_T{}_{}.jpg'.format(gen, d,T, info_type)
print(filename)
../umap_unsupervised_results/gen0_traditional_d158_T16_topounscaled.jpg
No description has been provided for this image

Compare the above plot with one for the first two UMAP components of the topological descriptor¶

Remember: Number of dimensions to reduce to matter when using UMAP

A word on UMAP¶

  • UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) is a dimension-reduction heuristic that consists of two main steps.
  • First: Through some sophisticated math, it approximates the high-dimensional data with a collection of vertices and edges (the math word is a 1-dimensional simplicial complex).
  • This simplicial complex is constructed in the same high-dimensional space.
  • This complex preserves important topological information of the high-dimensional data — in this case, connectivity.

  • Second: It brings down the high-dimensional simplicial complex to a low-dimensional space while trying its best to still keep the original connectivity. Like a 2-D or 12-D space.
  • Here is where things get tricky.
  • Depending on how many dimensions you have (2 vs 12), the algorithm will have extra space to work around to fit the simplicial complex.
  • Think of trying to shoehorn a cube into a line vs shoehorn it into a plane.
  • That's why UMAP 1 and UMAP 2 will be different depending on how many dimensions you had available when shoehorning at first.
  • That being said, a wide phenomenom observed when reducing to 2 dimensions should still be true if reducing to 12 (the other direction does not hold.)
  • As far as I know, unlike PCA, there is no way to tell how much variance is encoded by each UMAP component or how many UMAP components are necessary to correctly reduce the data.

  • In our case, we observe that when aggressively reducing to 2-dimensions, UMAP tends to form clusters based on individual panicles rather than individual accessions.
  • When looking at UMAP1 and UMAP2 after reducing to 12-dimensions instead, we see that individual panicle clusters are not as defined.
  • However, keep in mind that it is very possible that these panicle-based clusters become more obvious if we were able to visualize the all the 12 dimensions at once.

  • McInnes (the guy who deviced the UMAP theory) has a good, brief, friendly, high-level picture lecture on UMAP from SciPy 2018.
In [8]:
filename = dst + 'new_umap_gen{}_d{}_T{}_{}_{}_{}_{}.csv'.format(gen, d,T, *umap_params.values())
udata = pd.read_csv(filename)
ump = udata.iloc[:, 10:].values

filename = dst + 'umap_topological_158_16_50_0.1_2_manhattan_unsupervised.csv'
ump2 = pd.read_csv(filename, header=None).values
In [9]:
fig, ax = plt.subplots(1, 2, figsize=(10,5), sharex=False, sharey=False, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

for k,u in enumerate([ump2, ump]):
    ax[k].scatter(u[:,0], u[:,1], c=tvals, cmap=cmap, vmin=0, vmax=vmax)
    ax[k].set_xlabel('UMAP 1', fontsize=fs)
    ax[k].tick_params(labelsize=fs-3)
ax[0].set_ylabel('UMAP 2', fontsize=fs)
ax[0].set_title('UMAP-reduced to {} dims (Gen {})'.format(2, gen), fontsize=fs)
ax[1].set_title('UMAP-reduced to {} dims (Gen {})'.format(12, gen), fontsize=fs)
fig.suptitle('All topological features (T={}, d={})'.format(T,d), fontsize=fs)

fig.tight_layout();
No description has been provided for this image

Above: different colors correspond to different panicles¶

In [10]:
fig, ax = plt.subplots(8, 7, figsize=(18, 20), sharex=False, sharey=False, facecolor='snow')
axs = ax.shape; ax = np.atleast_1d(ax).ravel()

for i in range(len(founders_names)):
    j = axs[1]*(i//axs[1]) + i
    
    line = founders_names_original[i]
    accession = udata[udata.Founder == line]
    
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))
    mask = accession.index
    nask = np.setdiff1d(np.arange(len(udata)), mask)
        
    ax[j].scatter(ump2[nask,0], ump2[nask,1], c='darkgrey', marker='.', zorder=1)
    ax[j].scatter(ump2[mask,0], ump2[mask,1], c=vals, cmap=cmap, marker='o', vmin=0, vmax=2, zorder=2)
    ax[j+axs[1]].set_title(founders_names[i], fontsize=fs+4)
    ax[j+axs[1]].scatter(ump[nask,0], ump[nask,1], c='darkgrey', marker='.', zorder=1)
    ax[j+axs[1]].scatter(ump[mask,0], ump[mask,1], c=vals, cmap='plasma', marker='o', vmin=0, vmax=2, zorder=2)

for i in range(len(ax)):
    ax[i].tick_params(bottom=False, left=False, labelbottom=False, labelleft=False)
    ax[i].set_facecolor('whitesmoke')
    
fig.supxlabel('UMAP 1', fontsize=fs+4)
fig.supylabel('UMAP 2', fontsize=fs+4)

fig.suptitle('Topological shape descriptors when reducing to 2 (viridis) or to 12 (plasma) dimensions (Gen {})'.format(gen), fontsize=fs+5, y=1)
fig.tight_layout()

filename = dst + 'gen{}_topological_d{}_T{}.jpg'.format(gen, d,T)
print(filename)
#plt.savefig(filename, bbox_inches='tight', dpi=100, format='jpg', pil_kwargs={'optimize':True})
../umap_unsupervised_results/gen0_topological_d158_T16.jpg
No description has been provided for this image

Plot above: differences in UMAP1 and UMAP2 components when reducing all the 2500 topological descriptors to 2 or to 12 dimensions.¶

  • Different colors correspond to different panicles.
  • Colors are uncorrelated for different plots.
  • When reducing to 2 dimensions, colorscheme is viridis (purple, green, yellow)
  • When reducing to 12, colorscheme is plasma (blue, pink, yellow)

  • Again, we observe that UMAP seems to cluster seeds by panicle rather than by accession.
  • This is more observable in the 2-dim case, where panicle clusters are slightly more defined.

  • This observation might not be solely due to UMAP, but the original topological data might be predisposed to favor individual panicle differences over accession differences
  • We observe even a more acute phenomenom when reducing the topological descriptors to 12 dimensions with KPCA instead
  • Note: In KPCA —just like in PCA— the PC1 and PC2 values are the same whenever reducing to 2 or to 12 dimensions.

Producing plots with KPCA¶

Back with UMAP and $F_{18}$¶

  • We are going to use the UMAP components after reducing to 12 dimensions, since those were the ones used to make predictions in the first place.
  • Keep in mind that we would need to make plots of all 12 components against each other to get a full sense of how UMAP is treating the topological shape descriptors.
  • But even by just looking at the first 2 components, we already see differences compared to the results we got from the parents.
  • There are no clear panicle-based clusters, for starters
In [14]:
gen = 1

filename = dst + 'new_umap_gen{}_d{}_T{}_{}_{}_{}_{}.csv'.format(gen, d,T, *umap_params.values())
udata = pd.read_csv(filename)
ump = udata.iloc[:, 10:].values
filename = dst + 'gen{}_svm_combined_d{}_T{}_{}.csv'.format(gen, d,T, info_type)
ldata = pd.read_csv(filename)

foo = np.unique(ldata['Label (C-G-S-P)'])
mdict = dict(zip(foo,np.arange(len(foo))))
vmax = len(mdict)
tvals = ldata['Label (C-G-S-P)'].map(mdict).values
In [15]:
fig, ax = plt.subplots(1, 1, figsize=(5,5), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

ax[k].scatter(ump[:,0], ump[:,1], c=tvals, cmap=cmap, vmin=0, vmax=vmax)
ax[k].set_aspect('equal')

ax[k].set_xlabel('UMAP 1', fontsize=fs)
ax[k].set_ylabel('UMAP 2', fontsize=fs)
ax[k].tick_params(labelsize=fs-3)

ax[k].set_title('Topological shape descriptors\nUMAP-reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)

fig.tight_layout();
#plt.scatter(ump[:,0], ump[:,1])
No description has been provided for this image
In [24]:
fig, ax = plt.subplots(4, 7, figsize=(18, 10), sharex=True, sharey=True)
ax = np.atleast_1d(ax).ravel()
for i in range(len(founders_names)):
    line = founders_names_original[i]
    accession = ldata[ldata.Founder == line]
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))
    mask = accession.index
    nask = np.setdiff1d(np.arange(len(ldata)), mask)
        
    ax[i].scatter(ump[nask,0], ump[nask,1], c='darkgrey', marker='.', s=25, zorder=1)
    ax[i].scatter(ump[mask,0], ump[mask,1], c=vals, cmap=cmap, marker='.', s=50, vmin=0, vmax=np.max([1,len(foo)-1]), zorder=2)
    
    ax[i].set_title(founders_names[i], fontsize=fs)

for i in range(len(ax)):
    ax[i].tick_params(labelbottom=False, labelleft=False)
    
fig.supxlabel('UMAP 1', fontsize=fs+4)
fig.supylabel('UMAP 2', fontsize=fs+4)

fig.suptitle('Generation {}, SVM combined information, {}'.format(gen, info_type), fontsize=fs+7)
fig.tight_layout()

filename = dst + 'gen{}_svm_d{}_T{}_{}.jpg'.format(gen, d,T, info_type)
print(filename)
#plt.savefig(filename, bbox_inches='tight', dpi=100, format='jpg', pil_kwargs={'optimize':True})
../umap_unsupervised_results/gen1_svm_d158_T16_topounscaled.jpg
No description has been provided for this image

Above: Different colors correspond to different panicles.¶

  • Colors uncorrelated between plots.
  • There are no clear panicle-based clusters.

Same thing but with $F_{58}$¶

In [25]:
gen = 7

filename = dst + 'new_umap_gen{}_d{}_T{}_{}_{}_{}_{}.csv'.format(gen, d,T, *umap_params.values())
udata = pd.read_csv(filename)
ump = udata.iloc[:, 10:].values
filename = dst + 'gen{}_svm_combined_d{}_T{}_{}.csv'.format(gen, d,T, info_type)
ldata = pd.read_csv(filename)

foo = np.unique(ldata['Label (C-G-S-P)'])
mdict = dict(zip(foo,np.arange(len(foo))))
vmax = len(mdict)
tvals = ldata['Label (C-G-S-P)'].map(mdict).values
In [26]:
fig, ax = plt.subplots(1, 1, figsize=(5,5), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

ax[k].scatter(ump[:,0], ump[:,1], c=tvals, cmap=cmap, vmin=0, vmax=vmax)
ax[k].set_aspect('equal')

ax[k].set_xlabel('UMAP 1', fontsize=fs)
ax[k].set_ylabel('UMAP 2', fontsize=fs)
ax[k].tick_params(labelsize=fs-3)

ax[k].set_title('Topological shape descriptors\nUMAP-reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)

fig.tight_layout();
#plt.scatter(ump[:,0], ump[:,1])
No description has been provided for this image
In [27]:
fig, ax = plt.subplots(4, 7, figsize=(18, 10), sharex=True, sharey=True)
ax = np.atleast_1d(ax).ravel()

for i in range(len(founders_names)):
    line = founders_names_original[i]
    accession = ldata[ldata.Founder == line]
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))
    mask = accession.index
    nask = np.setdiff1d(np.arange(len(ldata)), mask)
        
    ax[i].scatter(ump[nask,0], ump[nask,1], c='darkgrey', marker='.', s=25, zorder=1)
    ax[i].scatter(ump[mask,0], ump[mask,1], c=vals, cmap=cmap, marker='.', s=50, vmin=0, vmax=np.max([1,len(foo)-1]), zorder=2)
    
    ax[i].set_title(founders_names[i], fontsize=fs)

for i in range(len(ax)):
    ax[i].tick_params(labelbottom=False, labelleft=False)
    
fig.supxlabel('UMAP 1', fontsize=fs+4)
fig.supylabel('UMAP 2', fontsize=fs+4)

fig.suptitle('Generation {}, SVM combined information, {}'.format(gen, info_type), fontsize=fs+7)
fig.tight_layout()

filename = dst + 'gen{}_svm_d{}_T{}_{}.jpg'.format(gen, d,T, info_type)
print(filename)
#plt.savefig(filename, bbox_inches='tight', dpi=100, format='jpg', pil_kwargs={'optimize':True})
../umap_unsupervised_results/gen7_svm_d158_T16_topounscaled.jpg
No description has been provided for this image

A word on computational reproducibility¶

  • I don't think this is relevant for this report purposes, but it might be important at some point.
  • UMAP is a stochastic heuristic: runs of the same code and data can yield different results (not just flip signs, like PCA).
  • According to the documentation, the python implementation should produce results that qualitatively vary little.
  • I originally used an old version of UMAP where you can set a fixed random seed number to reproduce results.
    • However, I did not do that.
    • Also, I accidentallly deleted the files where I had stored the values of UMAP I obtained for all the possible combination of parameters outside what I originally sent you (ECT for 158 directions, 16 thresholds, reduced to 12 dimensions).
    • I can run all the code again, but the exact UMAP values will be different
  • There is a more recent version of UMAP out there.
  • The most current version (v0.5 at the moment) is even more stochastic.
    • It exploits parallelization of processors, which means it runs faster but now it is impossible to fully reproduce results
    • Again, in theory, the results despite being numerically different, qualitatively they should be the same.
In [49]:
import umap
gen = 0
ect = edata[edata['Generation'] == gen].reset_index(drop=True)

trad = tdata[tdata['Generation'] == gen].reset_index(drop=True)
foo = np.unique(trad['Label (C-G-S-P)'])
mdict = dict(zip(foo,np.arange(len(foo))))
tvals = trad['Label (C-G-S-P)'].map(mdict).values
In [35]:
dims = 2
umap_params = {'n_neighbors':50, 'min_dist':0.1, 'n_components':dims, 'metric':'manhattan'}
umap_trans = umap.UMAP(**umap_params).fit(ect.iloc[:, 9:].values)
u_founders = umap_trans.transform(ect.iloc[:,9:].values)
In [39]:
dims = 12
umap_params = {'n_neighbors':50, 'min_dist':0.1, 'n_components':dims, 'metric':'manhattan'}
umap12_trans = umap.UMAP(**umap_params).fit(ect.iloc[:, 9:].values)
u12_founders = umap12_trans.transform(ect.iloc[:,9:].values)
In [40]:
fig, ax = plt.subplots(1, 1, figsize=(5,5), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

ax[k].scatter(u_founders[:,0], u_founders[:,1], c=tvals, cmap=cmap, vmin=0)
ax[k].set_aspect('equal')

ax[k].set_xlabel('UMAP 1', fontsize=fs)
ax[k].set_ylabel('UMAP 2', fontsize=fs)
ax[k].tick_params(labelsize=fs-3)

ax[k].set_title('Topological shape descriptors\nUMAP-reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)

fig.tight_layout();
No description has been provided for this image
In [41]:
fig, ax = plt.subplots(4, 7, figsize=(18, 10), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel()

for i in range(len(founders_names)):
    line = founders_names_original[i]
    accession = trad[trad.Founder == line]
    
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))

    mask = accession.index
    nask = np.setdiff1d(np.arange(len(trad)), mask)
        
    ax[i].scatter(u_founders[nask,0], u_founders[nask,1], c='darkgrey', marker='.', s=25, zorder=1)
    ax[i].scatter(u_founders[mask,0], u_founders[mask,1], c=vals, cmap=cmap, marker='.', s=50, vmin=0, vmax=2, zorder=2)
    ax[i].set_title(founders_names[i], fontsize=fs)

for i in range(len(ax)):
    ax[i].tick_params(labelbottom=False, labelleft=False)
    ax[i].set_aspect('equal')
    ax[i].set_facecolor('whitesmoke')
    
fig.supxlabel('UMAP 1', fontsize=fs)
fig.supylabel('UMAP 2', fontsize=fs)

fig.suptitle('Traditional shape descriptors after being PCA reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)
fig.tight_layout()

filename = dst + 'gen{}_traditional_d{}_T{}_{}.jpg'.format(gen, d,T, info_type)
print(filename)
../umap_unsupervised_results/gen0_traditional_d158_T16_topounscaled.jpg
No description has been provided for this image
In [46]:
fig, ax = plt.subplots(1, 2, figsize=(10,5), sharex=False, sharey=False, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

for k,u in enumerate([u_founders, u12_founders]):
    ax[k].scatter(u[:,0], u[:,1], c=tvals, cmap=cmap, vmin=0, vmax=len(mdict)-1)
    ax[k].set_xlabel('UMAP 1', fontsize=fs)
    ax[k].set_ylabel('UMAP 2', fontsize=fs)
    ax[k].tick_params(labelsize=fs-3)

ax[0].set_title('UMAP-reduced to {} dims (Gen {})'.format(2, gen), fontsize=fs)
ax[1].set_title('UMAP-reduced to {} dims (Gen {})'.format(12, gen), fontsize=fs)

fig.tight_layout();
No description has been provided for this image
In [47]:
fig, ax = plt.subplots(4, 7, figsize=(18, 10), sharex=True, sharey=True, facecolor='snow')
ax = np.atleast_1d(ax).ravel()

for i in range(len(founders_names)):
    line = founders_names_original[i]
    accession = trad[trad.Founder == line]
    
    foo = np.unique(accession['Label (C-G-S-P)'])
    vals = accession['Label (C-G-S-P)'].map(dict(zip(foo, np.arange(len(foo)))))

    mask = accession.index
    nask = np.setdiff1d(np.arange(len(trad)), mask)
        
    ax[i].scatter(u12_founders[nask,0], u12_founders[nask,1], c='darkgrey', marker='.', s=25, zorder=1)
    ax[i].scatter(u12_founders[mask,0], u12_founders[mask,1], c=vals, cmap=cmap, marker='.', s=50, vmin=0, vmax=2, zorder=2)
    ax[i].set_title(founders_names[i], fontsize=fs)

for i in range(len(ax)):
    ax[i].tick_params(labelbottom=False, labelleft=False)
    ax[i].set_aspect('equal')
    ax[i].set_facecolor('whitesmoke')
    
fig.supxlabel('UMAP 1', fontsize=fs)
fig.supylabel('UMAP 2', fontsize=fs)

fig.suptitle('Topological shape descriptors after being PCA reduced to {} dims (Gen {})'.format(dims, gen), fontsize=fs)
fig.tight_layout()

filename = dst + 'gen{}_traditional_d{}_T{}_{}.jpg'.format(gen, d,T, info_type)
print(filename)
../umap_unsupervised_results/gen0_traditional_d158_T16_topounscaled.jpg
No description has been provided for this image
In [54]:
filename = dst + 'new_umap_gen{}_d{}_T{}_{}_{}_{}_{}.csv'.format(gen, d,T, *umap_params.values())
udata = pd.read_csv(filename)
ump = udata.iloc[:, 10:].values

filename = dst + 'umap_topological_158_16_50_0.1_2_manhattan_unsupervised.csv'
ump2 = pd.read_csv(filename, header=None).values
In [60]:
fig, ax = plt.subplots(2, 2, figsize=(9,9), sharex=False, sharey=False, facecolor='snow')
ax = np.atleast_1d(ax).ravel(); k = 0

for k,u in enumerate([ump2, ump, u_founders, u12_founders]):
    ax[k].scatter(u[:,0], u[:,1], c=tvals, cmap=cmap, vmin=0, vmax=len(mdict)-1)
    ax[k].set_aspect('equal', 'datalim')
    ax[k].tick_params(labelsize=fs)

for k in [0,2]:
    ax[k].set_ylabel('UMAP 2', fontsize=fs)
    ax[2+k//2].set_xlabel('UMAP 1', fontsize=fs)

ax[0].set_title('UMAP-reduced to {} dims (Gen {})\nOld UMAP version'.format(2, gen), fontsize=fs)
ax[1].set_title('UMAP-reduced to {} dims (Gen {})\nOld UMAP version'.format(12, gen), fontsize=fs)
ax[2].set_title('New UMAP version', fontsize=fs)
ax[3].set_title('New UMAP version', fontsize=fs)

fig.tight_layout();
No description has been provided for this image